Frequency-Based Matching in the Fellegi-Sunter Model of Record Linkage

نویسنده

  • William E. Winkler
چکیده

1. INTRODUCTION As a special case of their general theory of record linkage, Fellegi and Sunter (1969) presented a formal model for matching that uses the relative frequency of strings being compared. For instance, a surname that is relatively rare in pairs of records taken from two files has more distinguishing power than a common one. Most applications of frequency-based matching have used close variants of the basic model but have made different simplifying assumptions that reduce computation and facilitate table building. This paper introduces an extended methodology under weaker assumptions. While the amount of computation is significantly increased (as much as an order of magnitude), the need for expert human intervention is reduced. Most or all of the matching parameters can be automatically computed using file characteristics alone. The methodology does not require calibration data sets on which true match status has been determined. No a priori assumptions about parameters or previously created lookup tables are needed. Relative frequency tables are more suitable for situations when one list cannot be assumed a near subset of another. When one list is a near subset of the other and a number of other simplifying assumptions are made, the new method yields tables comparable to those obtained via previous methods. If the matching is performed on a subset of pairs (such as those agreeing on Soundex code of surname or on specific geographic identifiers), then adjustments of the parameters and decision rules to the subsets are also automatic. In the second section of the paper, background on the Fellegi-Sunter model of record linkage is presented. The third section is divided into five parts. The first contains the basic theory for the new frequency-based methods. The theory holds for all pairs in the product space of two files. In the second, a method of adjusting for typographical variation is given. The method partially accounts for the fact that observed frequencies do not necessarily correspond to true frequencies. The third part shows how matching decision rules can utilize both frequency-based weights and simpler agree/disagree weights obtained via the Expectation-Maximization (EM) Algorithm (Winkler 1988, 1989a,c; Thibaudeau 1989). As the EM-derived weights are sometimes obtained on subsets of pairs such as those agreeing on geographical subregions, two methods for adjusting the frequency-based weights to subsets are given. The fourth part contains empirical results for a comparison of files having substantial amounts of accurate information. In the …

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

BUREAU OF THE CENSUS STATISTICAL RESEARCH DIVISION Statistical Research Report Series No. RR2000/06 Frequency-Based Matching in Fellegi-Sunter Model of Record Linkage

This paper extends techniques for frequency-based matching (see e.g., Fellegi and Sunter 1969). The extended techniques allow table-building under weaker assumptions than those typically used in practice. Although CPU requirements can increase, human intervention can be reduced in some situations.

متن کامل

Data Cleaning Methods

Data Cleaning methods are used for finding duplicates within a file or across sets of files. This overview provides background on the Fellegi-Sunter model of record linkage. The Fellegi-Sunter model provides an optimal theoretical classification rule. Fellegi and Sunter introduced methods for automatically estimating optimal parameters without training data that we extend to many real world sit...

متن کامل

Approaches to Multiple Record Linkage

We review the theory and techniques of record linkage that date back to pioneering work by Fellegi and Sunter on matching records in two lists. When the task involves linking K > 2 lists, the most common approach consists of performing all ( K 2 ) possible pairs of lists using a Fellegi-Sunter-like approach and then somehow reconciling the discrepancies in an ad hoc fashion. We describe some im...

متن کامل

The State of Record Linkage and Current Research Problems

This paper provides an overview of methods and systems developed for record linkage. Modern record linkage begins with the pioneering work of Newcombe and is especially based on the formal mathematical model of Fellegi and Sunter. In their seminal work, Fellegi and Sunter introduced many powerful ideas for estimating record linkage parameters and other ideas that still influence record linkage ...

متن کامل

A Generalized Fellegi-Sunter Framework for Multiple Record Linkage With Application to Homicide Record-Systems

We present a probabilistic method for linking multiple datafiles. This task is not trivial in the absence of unique identifiers for the individuals recorded. This is a common scenario when linking census data to coverage measurement surveys for census coverage evaluation, and in general when multiple record–systems need to be integrated for posterior analysis. Our method generalizes the Fellegi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002